Unzips a specified ZIP file to read a CSV file while keeping specified columns.
Usage: read_csv_from_zip(zip_filepath, csv_filename, columns_to_keep)
Splits a dataset into training and test sets based on a specified training ratio.
Usage: split_data(data, train_ratio = 0.8)
# Set a seed for reproducibility and to minimize RAM usage
set.seed(62380486)
# Function to split data into training and test sets
split_data <- function(data, train_ratio = 0.8) {
# validate train_ratio range
if (train_ratio <= 0 || train_ratio >= 1) {
stop("Error: train_ratio must be between 0 and 1 (exclusive).")
}
# Randomly select the specified percentage of indices for the training set
train_ind <- sample(1:nrow(data),
size = floor(train_ratio * nrow(data)),
replace = FALSE)
# Use the remaining indices for the test set
test_ind <- setdiff(1:nrow(data), train_ind)
# Create training data using the selected indices
train_data <- data[train_ind, , drop = FALSE]
rownames(train_data) <- NULL
# Create test data using the remaining indices
test_data <- data[test_ind, , drop = FALSE]
rownames(test_data) <- NULL
# Return both training and test data as a list
return(list(train = train_data, test = test_data))
}
# load data from dataset using common function read_csv_from_zip
xdata <- read_csv_from_zip(zip_filepath = "./data/data_assignment_2.zip",
csv_filename = "heart.csv",
columns_to_keep = c("DEATH", "GLUCOSE", "SYSBP")
)
skim(xdata)
| Name | xdata |
| Number of rows | 11627 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| DEATH | 0 | 1.00 | 0.30 | 0.46 | 0.0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
| GLUCOSE | 1440 | 0.88 | 84.12 | 24.99 | 39.0 | 72 | 80 | 89 | 478 | ▇▁▁▁▁ |
| SYSBP | 0 | 1.00 | 136.32 | 22.80 | 83.5 | 120 | 132 | 149 | 295 | ▆▇▁▁▁ |
Here, we noticed that GLUCOSE has 1440 missing value, accounting for 12.38% of total rows. We are adopting a Median Imputation strategy after comparing below methods’ defects.
Complete Case Deletion: The 12.38% missing data proportion is too high, risking significant information loss and bias.
Regression/Classification Imputation: Using other variables (like SYSBP and DEATH) to predict GLUCOSE introduces data leakage when DEATH is the target variable.
Missing Values as a Separate Category: This approach is unsuitable for numerical features like GLUCOSE as it can distort the variable’s distribution and introduce bias.
# Calculate the median of the GLUCOSE variable, ignoring NA values
glucose_median <- median(xdata$GLUCOSE, na.rm = TRUE)
# Impute the missing values in GLUCOSE with the calculated median
xdata$GLUCOSE[is.na(xdata$GLUCOSE)] <- glucose_median
# Check missing data with skim method, this time, GLUCOSE's missing shoudl count 0
skim(xdata)
| Name | xdata |
| Number of rows | 11627 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| DEATH | 0 | 1 | 0.30 | 0.46 | 0.0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
| GLUCOSE | 0 | 1 | 83.61 | 23.43 | 39.0 | 73 | 80 | 87 | 478 | ▇▁▁▁▁ |
| SYSBP | 0 | 1 | 136.32 | 22.80 | 83.5 | 120 | 132 | 149 | 295 | ▆▇▁▁▁ |
Q(a). Split the dataset into a training set (80% of entries) and a test set (20% of entries).
sp_data <- split_data(data=xdata, train_ratio = 0.8)
sp_data
## $train
## # A tibble: 9,301 × 3
## DEATH GLUCOSE SYSBP
## <dbl> <dbl> <dbl>
## 1 0 81 137
## 2 0 76 125
## 3 0 101 147
## 4 1 87 120
## 5 0 71 154.
## 6 1 64 138
## 7 1 80 145
## 8 0 78 116
## 9 0 76 99
## 10 0 85 131
## # ℹ 9,291 more rows
##
## $test
## # A tibble: 2,326 × 3
## DEATH GLUCOSE SYSBP
## <dbl> <dbl> <dbl>
## 1 1 103 150
## 2 0 85 138
## 3 0 98 149
## 4 0 80 110.
## 5 0 79 142.
## 6 0 83 120
## 7 1 70 140
## 8 0 72 138
## 9 0 80 140.
## 10 0 91 144.
## # ℹ 2,316 more rows
Q(b). Visualise the relationship between DEATH ,
GLUCOSE and SYSBP (s a suitable way. Form an
initial hypothesis of what to look for when doing the
classification.
library(plotly)
fig <- plot_ly(xdata, x = ~GLUCOSE, y = ~SYSBP, z = ~DEATH,
color = ~factor(DEATH), # Convert DEATH to a factor for coloring
colors = c("blue", "red"),
marker = list(size = 2),
symbol = "DEATH",
alpha = 0.45,
type = "scatter3d",
mode = "markers",
# Add mouse hover text
text = ~paste("GLUCOSE:", GLUCOSE, "<br>SYSBP:", SYSBP, "<br>DEATH:", DEATH)
)
fig <- fig %>% layout(
scene = list(
xaxis = list(title = "GLUCOSE"),
yaxis = list(title = "SYSBP"),
zaxis = list(title = "DEATH")
))
fig # show figure
We will use a function to apply on GLUCOSE and
SYSBP to get the probability of DEATH
being “1” (or “0”, they are almost the same since it’s a binary
situation), this can be written as \[Pr(G\ =
1|X)\]
where \(X\) is a vector, containing
features \(x_1: GLUCOSE, x_2: SYSBP\);
\(G\) represents the output binary
variable DEATH.
To make sure the function will always return the result between 0 and
1, we can use the following sigmoid function to estimate
the probability of being class 1:
\[g_1(x) = \mathbb P(G = 1| X=x) = \frac{\exp(b_0 + b_1x_1 + b_2x_2)}{1 + \exp(b_0 + b_1 x + b_2x_2)} \] so the probability of being class 0 would be: \[ g_0(x) = \mathbb P(G = 0| X=x) = \frac{1}{1 + \exp(b_0 + b_1 x + b_2x_2)} \]
The problem now is to find the optimal combination of
b0, b1 and b2 from the training
set.
Q(c). On the training set, fit a (multiple) logistic regression model.
N.B. In this question, you are allowed to use
glm.
logreg.fit <- glm(
formula = DEATH ~ GLUCOSE + SYSBP,
family = binomial,
data = sp_data$train,
na.action = na.omit,
model = TRUE,
method = "glm.fit",
x = FALSE,
y = TRUE,
contrasts = NULL
)
summary(logreg.fit)
##
## Call:
## glm(formula = DEATH ~ GLUCOSE + SYSBP, family = binomial, data = sp_data$train,
## na.action = na.omit, model = TRUE, method = "glm.fit", x = FALSE,
## y = TRUE, contrasts = NULL)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.832113 0.167240 -28.893 < 2e-16 ***
## GLUCOSE 0.007911 0.001067 7.413 1.23e-13 ***
## SYSBP 0.024086 0.001047 23.008 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11408 on 9300 degrees of freedom
## Residual deviance: 10722 on 9298 degrees of freedom
## AIC: 10728
##
## Number of Fisher Scoring iterations: 4
# 1. Predict probabilities (using the test set)
predicted_probabilities <- predict(logreg.fit, newdata = sp_data$test, type = "response")
# 2. Set the threshold
threshold <- 0.5
# 3. Convert to binary classification
predicted_classes <- ifelse(predicted_probabilities > threshold, 1, 0)
# 4. Calculate the confusion matrix (using the test set)
confusion_matrix <- table(Actual = sp_data$test$DEATH, Predicted = predicted_classes)
# 5. Calculate the misclassification rate
misclassification_rate <- (confusion_matrix[1, 2] + confusion_matrix[2, 1]) / sum(confusion_matrix)
misclassification_rate
## [1] 0.2919175
confusion_matrix
## Predicted
## Actual 0 1
## 0 1533 83
## 1 596 114
library(plotly)
# 1. 创建一个网格来覆盖 GLUCOSE 和 SYSBP 的范围
glucose_range <- seq(min(sp_data$train$GLUCOSE, na.rm = TRUE), max(sp_data$train$GLUCOSE, na.rm = TRUE), length.out = 50)
sysbp_range <- seq(min(sp_data$train$SYSBP, na.rm = TRUE), max(sp_data$train$SYSBP, na.rm = TRUE), length.out = 50)
grid <- expand.grid(GLUCOSE = glucose_range, SYSBP = sysbp_range)
# 2. 使用模型预测网格上的概率
grid$predicted_probability <- predict(logreg.fit, newdata = grid, type = "response")
# 3. 将预测概率转换为矩阵
probability_matrix <- matrix(grid$predicted_probability, nrow = length(glucose_range), ncol = length(sysbp_range))
# 4. 创建 2D 等高线图
fig <- plot_ly(
x = glucose_range,
y = sysbp_range,
z = probability_matrix,
type = "contour",
contours = list(
showlabels = TRUE,
labelfont = list(
size = 12,
color = "black"
),
start = 0,
end = 1,
size = 0.1
)
) %>%
layout(
title = "2D Decision Boundary with Test Data",
xaxis = list(title = "GLUCOSE"),
yaxis = list(title = "SYSBP")
)
# 5. 添加测试数据集的点,并使用 DEATH 变量作为颜色和悬停文本
fig <- fig %>% add_trace(
data = sp_data$test,
x = ~GLUCOSE,
y = ~SYSBP,
type = "scatter",
mode = "markers",
marker = list(
size = 5,
color = ifelse(sp_data$test$DEATH == 1, "red", "blue"),
opacity = 0.8
),
text = ~ifelse(DEATH == 1, "DEATH=1", "DEATH=0"), # 设置悬停文本
hoverinfo = "text", # 只显示悬停文本
name = "Test Data"
)
# 6. 添加图例 (可选,可以调整位置)
fig <- fig %>% layout(
showlegend = TRUE,
legend = list(
x = 0.85, # 图例的 x 坐标 (0-1)
y = 0.85 # 图例的 y 坐标 (0-1)
)
)
# 7. 显示图形
fig